Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach

نویسندگان

  • Yunqing Xia
  • Kam-Fai Wong
  • Wenjie Li
چکیده

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in processing unseen chat terms. This motivates us to construct a chat language corpus so that corpus-based techniques of chat language text processing can be developed and evaluated. However, creating the corpus merely by hand is difficult. One, this work is manpower consuming. Second, annotation inconsistency is serious. To minimize manpower and annotation inconsistency, a two-stage incremental annotation approach is proposed in this paper in constructing a chat language corpus. Experiments conducted in this paper show that the performance of corpus annotation can be improved greatly with this approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Two-Stage Incremental Annotation Approach to Constructing a Network Informal Language Corpus

Network Informal Language (NIL) refers to the special human language widely used in the community of digital network chat via platforms such as chat rooms/tools, mobile phone short message services (SMS), bulletin board systems (BBS), emails, etc. NIL holds anomalous characteristics in forming words, phrases, and non-alphabetical characters. This makes it difficult to handle NIL text by convent...

متن کامل

Chinese-English Parallel Corpus Construction and its Application

Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...

متن کامل

A Phonetic-Based Approach to Chinese Chat Text Normalization

Chatting is a popular communication media on the Internet via ICQ, chat rooms, etc. Chat language is different from natural language due to its anomalous and dynamic natures, which renders conventional NLP tools inapplicable. The dynamic problem is enormously troublesome because it makes static chat language corpus outdated quickly in representing contemporary chat language. To address the dyna...

متن کامل

Build a Large-Scale Syntactically Annotated Chinese Corpus

This paper reports on our research to build a large-scale Tsinghua Chinese Treebank (TCT). We propose a two-stage approach to reduce manual proofreading labors as much as possible. The insertion of an intermediate functional chunk level creates a good information bridge to link simple chunk annotation with detailed syntactic tree annotation. We describe our chunk and tree annotation schemes, fo...

متن کامل

Anomaly Detecting Within Dynamic Chinese Chat Text

The problem in processing Chinese chat text originates from the anomalous characteristics and dynamic nature of such a text genre. That is, it uses ill-edited terms and anomalous writing styles in chat text, and the anomaly is created and discarded very quickly. To handle this problem, one solution is to re-train the recognizer periodically. This costs a lot of manpower in producing the timely ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006